The goals of this tab is to gain a deeper understanding of the data, how variables behave, and identify anomalies, paterns, or potential insights for future exploration.
Vehicle Production by countries
Here, we will be exploring the vehicle_production_countries data set.
Code
library(tidyverse)library(ggplot2)library(boot)library(AICcmodavg)library(readxl)library(broom)vehicle_production_countries_comercial <-read_excel("../../data/Raw_Data_project_Pub.Transport_5000/vehicle_production_countries.xlsx", sheet="Comercial-vehicles", skip =1)# Replace "N", "U", and "Z" with NULL in all columns except the first onevehicle_production_countries_comercial[-1] <-lapply(vehicle_production_countries_comercial[-1], function(x) ifelse(x %in%c("N", "U", "Z"), NA, x))vehicle_production_countries_comercial <- vehicle_production_countries_comercial %>%mutate(across(-1, as.numeric))colnames(vehicle_production_countries_comercial)[colnames(vehicle_production_countries_comercial) =="...1"] <-"Country"colnames(vehicle_production_countries_comercial)# Transpose the data framevehicle_production_countries_comercial_transposed <-as.data.frame(t(vehicle_production_countries_comercial))# Set the first row as the column namescolnames(vehicle_production_countries_comercial_transposed) <- vehicle_production_countries_comercial_transposed[1, ]# Remove the first rowvehicle_production_countries_comercial_transposed <- vehicle_production_countries_comercial_transposed[-1, ]# Convert the year columns to numericvehicle_production_countries_comercial_transposed[] <-lapply(vehicle_production_countries_comercial_transposed, as.numeric)vehicle_production_countries_comercial_transposed <- vehicle_production_countries_comercial_transposed[, !colnames(vehicle_production_countries_comercial_transposed) %in%"Total world"]df <- vehicle_production_countries_comercial_transposed %>%rownames_to_column(var ="Year")# Replace "(R) 2019" with 2019 and "(R) 2020" with 2020df <- df %>%mutate(Year =ifelse(Year =="(R) 2019", 2019, ifelse(Year =="(R) 2020", 2020, Year)))
Code
# Pivot the data to long formatdf_long <- df %>%pivot_longer(-Year, names_to ="Country", values_to ="Value")# Plotlotggplot(df_long, aes(x = Year, y = Value, color = Country)) +geom_point() +labs(title ="Value per year and per country in comercial vehicle production",x ="Year",y ="Value of comercial vehicles produced" ) +theme_minimal() +theme(legend.position ="bottom", axis.text.x =element_text(angle =90, hjust =1))
From this plot we can see how Japan (back in the 1960s) used to be one of the biggest producers but not anymore. The US has been in the lead since the 1990s until around 2008, which was expected due to the recession. However, it is impressive how China was able to excel during the recession in this market at an almost exponential rate. And it still has kept increasing since then.
In order to run an ANOVA test and look if countries, years, and interaction among the both have an statistical significance on comercial vehicle production, we are going to assume that all the missing values are not relevant to the contribution of the overall market and, therefore, close to 0. Thuis, for the sake of this process, we will make them 0.
Code
# Replace NA values with 0vehicle_production_countries_comercial <-replace(vehicle_production_countries_comercial, is.na(vehicle_production_countries_comercial), 0)
Code
data_long <- vehicle_production_countries_comercial %>%pivot_longer(cols =-Country, names_to ="Year", values_to ="Value")data_long <- data_long[data_long$Country !="Total world", ]# Perform ANOVAmodel <-aov(Value ~ Year * Country, data = data_long)# Summarize the ANOVA results with significance codessummary_model <-summary(model)# Print the ANOVA table with significance codesprint(anova(model, test ="F"))
Warning message in anova.lm(object):
"ANOVA F-tests on an essentially perfect fit are unreliable"
Analysis of Variance Table
Response: Value
Df Sum Sq Mean Sq F value Pr(>F)
Year 31 85531235 2759072 NaN NaN
Country 32 2172668273 67895884 NaN NaN
Year:Country 992 1139993857 1149187 NaN NaN
Residuals 0 0 NaN
Seing this, we will simplify our model, taking away the interaction (in order to no overfit the data with our ANOVA model).
Code
model <-aov(Value ~ Year + Country, data = data_long)# Summarize the ANOVA results with significance codessummary_model <-summary(model)# Print the ANOVA table with significance codesprint(anova(model, test ="F"))
Analysis of Variance Table
Response: Value
Df Sum Sq Mean Sq F value Pr(>F)
Year 31 85531235 2759072 2.4009 3.244e-05 ***
Country 32 2172668273 67895884 59.0816 < 2.2e-16 ***
Residuals 992 1139993857 1149187
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Both have a significant effect (which was expected). However, this helped us also understand that their interaction would overfit the model in this case.
Energy and usage of Buses
In this part, we will be exploring the energy_consumed_byMill_passenger_MILES file.
Code
energy_consum <-read_excel("../../data/Raw_Data_project_Pub.Transport_5000/energy_consumed_byMill_passenger_MILES.xlsx", sheet="Energy")colnames(energy_consum) <- energy_consum[1, ]# Remove the first row and the first columnenergy_consum <- energy_consum[-1, ]# Set the first column to "Data_to_Explore"colnames(energy_consum)[1] <-"Data_to_Explore"# Filter rows based on specific criteriaenergy_consum <- energy_consum[energy_consum$Data_to_Explore %in%c("Vehicle-miles (millions)","Passenger-miles (millions)","Energy consumed, total (billion Btu)","Energy intensity (Btu/passenger-mile)"), ]# Replace all "N" values (except in the first column) with NAenergy_consum[, -1] <-apply(energy_consum[, -1], 2, function(x) ifelse(x =="N", NA, x))# Drop specified columns because of numerous NAenergy_consum <- energy_consum %>%select(-"1960", -"1965", -"1970", -"1975", -"1980", -"1985", -"1990", -"1991", -"1992", -"1993", -"1994", -"1995")# Change columns from character to numeric (except the first column)energy_consum <- energy_consum %>%mutate_at(vars(-1), as.numeric)energy_consum_long <-as.data.frame(t(energy_consum), index=False)colnames(energy_consum_long) <- energy_consum_long[1, ]# Remove the first row and the first columnenergy_consum_long <- energy_consum_long[-1, ]colnames(energy_consum_long) <-c("Vehicle_milesMill", "Passenger_miles_mill","Energy_consumed_total_bill_Btu", "Energy_intensity_Btu_passenger_mile")Years =c(1996:2021)energy_consum_long['Years'] <- Years# data frame used for the modelsenergy_consum_long_no_Covid <- energy_consum_long[!(energy_consum_long$Years %in%c(2020, 2021)), ]
For the fitted model in the following plots, year 2020 and 2021 are going to be ignored and considered outliers due to COVID-19. They are heavy outliers.
Code
#plot(energy_consum_long$Energy_intensity_Btu_passenger_mile)# Fit a linear modellm_model <-lm(energy_consum_long_no_Covid$Energy_intensity_Btu_passenger_mile ~ energy_consum_long_no_Covid$Year)# Create the plotplot(energy_consum_long_no_Covid$Year, energy_consum_long_no_Covid$Energy_intensity_Btu_passenger_mile,xlab ="Year",ylab ="Energy consumed, total (billion Btu)",col ="blue3",pch =19)# Calculate confidence intervalconf_int <-predict(lm_model, interval ="confidence")# Add confidence interval lineslines(energy_consum_long_no_Covid$Year, conf_int[, "lwr"], col ="red", lty =2)lines(energy_consum_long_no_Covid$Year, conf_int[, "upr"], col ="red", lty =2)# Shade the area between confidence interval boundspolygon(c(energy_consum_long_no_Covid$Year, rev(energy_consum_long_no_Covid$Year)),c(conf_int[, "lwr"], rev(conf_int[, "upr"])),col ="lightgray", border =NA)# Add the trend lineabline(lm_model, col ="red")# Add the pointspoints(energy_consum_long_no_Covid$Year, energy_consum_long_no_Covid$Energy_intensity_Btu_passenger_mile, col ="blue3", pch =19)
From this plot we can see how the total energy consumed (in billion Btu) has been constantly decreasing. However, is it because our vehicles are more efficient or due to cuts in public transportation?
Code
#plot(energy_consum_long$Passenger_miles_mill)# Fit a linear modellm_model <-lm(energy_consum_long_no_Covid$Passenger_miles_mill ~ energy_consum_long_no_Covid$Year)# Create the plotplot(energy_consum_long_no_Covid$Year, energy_consum_long_no_Covid$Passenger_miles_mill,xlab ="Year",ylab ="Passenger-miles (millions)",col ="blue3",pch =19)# Calculate confidence intervalconf_int <-predict(lm_model, interval ="confidence")# Add confidence interval lineslines(energy_consum_long_no_Covid$Year, conf_int[, "lwr"], col ="red", lty =2)lines(energy_consum_long_no_Covid$Year, conf_int[, "upr"], col ="red", lty =2)# Shade the area between confidence interval boundspolygon(c(energy_consum_long_no_Covid$Year, rev(energy_consum_long_no_Covid$Year)),c(conf_int[, "lwr"], rev(conf_int[, "upr"])),col ="lightgray", border =NA)# Add the trend lineabline(lm_model, col ="red")# Add the pointspoints(energy_consum_long_no_Covid$Year, energy_consum_long_no_Covid$Passenger_miles_mill, col ="blue3", pch =19)
While the fitted model shows an increase in the usage of public transportation, we can see that it has been constantly dropping since around 2014. This can bring up many questions such as: Is it due to an investment problem? Is it because there are not enough incentives to use public transportation? Do people own more cars?
Code
#plot(energy_consum_long$Energy_intensity_Btu_passenger_mile)# Fit a linear modellm_model <-lm(energy_consum_long_no_Covid$Vehicle_milesMill ~ energy_consum_long_no_Covid$Year)# Create the plotplot(energy_consum_long_no_Covid$Year, energy_consum_long_no_Covid$Vehicle_milesMill,xlab ="Year",ylab ="Vehicle miles (millions)",col ="blue3",pch =19)# Calculate confidence intervalconf_int <-predict(lm_model, interval ="confidence")# Add confidence interval lineslines(energy_consum_long_no_Covid$Year, conf_int[, "lwr"], col ="red", lty =2)lines(energy_consum_long_no_Covid$Year, conf_int[, "upr"], col ="red", lty =2)# Shade the area between confidence interval boundspolygon(c(energy_consum_long_no_Covid$Year, rev(energy_consum_long_no_Covid$Year)),c(conf_int[, "lwr"], rev(conf_int[, "upr"])),col ="lightgray", border =NA)# Add the trend lineabline(lm_model, col ="red")# Add the pointspoints(energy_consum_long_no_Covid$Year, energy_consum_long_no_Covid$Vehicle_milesMill, col ="blue3", pch =19)
The vehicle miles have been constantly increasing, which means that there have been more and more routes added over time. However, why has this not been enough to increase the demand of public transportation?
DC Metro Scorecard
This part will focus on the DC_Metro_Scorecard data, which counts the reliability and efficiency of DC Metro from 2014 to 2016.
Firstly, we are going to see wherther the year and the month have a significant effect on the values seen in our data. This will help us understand if it has become better over the years or if there are months that have effects on the outcomes of public transportation due to weather or other circumstances.
Code
Bus_on_time <- DC_metro %>%select('Year', 'Month', 'Bus_on_time')model <-aov(Bus_on_time ~ Year + Month, data = Bus_on_time)# Summarize the ANOVA results with significance codessummary_model <-summary(model)# Print the ANOVA table with significance codesprint(anova(model, test ="F"))
Analysis of Variance Table
Response: Bus_on_time
Df Sum Sq Mean Sq F value Pr(>F)
Year 2 0.0012562 0.00062809 7.6584 0.0031782 **
Month 12 0.0056355 0.00046963 5.7263 0.0002625 ***
Residuals 21 0.0017223 0.00008201
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Clearly year and month seem to have a significant effect on the buses being on time.
Code
Bus_fleet_reliability <- DC_metro %>%select('Year', 'Month', 'Bus_fleet_reliability')model <-aov(Bus_fleet_reliability ~ Year + Month, data = Bus_fleet_reliability)# Summarize the ANOVA results with significance codessummary_model <-summary(model)# Print the ANOVA table with significance codesprint(anova(model, test ="F"))
Analysis of Variance Table
Response: Bus_fleet_reliability
Df Sum Sq Mean Sq F value Pr(>F)
Year 2 5604536 2802268 9.3323 0.001259 **
Month 12 9091418 757618 2.5231 0.030574 *
Residuals 21 6305777 300275
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Clearly year and month seem to have a significant effect on the bus fleet reliability (miles travelled before a breakdown).
Code
Rail_fleet_reliability <- DC_metro %>%select('Year', 'Month', 'Rail_fleet_reliability')model <-aov(Rail_fleet_reliability ~ Year + Month, data = Rail_fleet_reliability)# Summarize the ANOVA results with significance codessummary_model <-summary(model)# Print the ANOVA table with significance codesprint(anova(model, test ="F"))
Analysis of Variance Table
Response: Rail_fleet_reliability
Df Sum Sq Mean Sq F value Pr(>F)
Year 2 536503502 268251751 2.8728 0.07892 .
Month 12 2480920803 206743400 2.2141 0.05343 .
Residuals 21 1960928969 93377570
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this case, year and month don’t seem to have a significant effect on rail transportation fleet reliability (miles travelled before a breakdown).
Code
Rail_on_time <- DC_metro %>%select('Year', 'Month', 'Rail_on_time')model <-aov(Rail_on_time ~ Year + Month, data = Rail_on_time)# Summarize the ANOVA results with significance codessummary_model <-summary(model)# Print the ANOVA table with significance codesprint(anova(model, test ="F"))
Analysis of Variance Table
Response: Rail_on_time
Df Sum Sq Mean Sq F value Pr(>F)
Year 2 0.060978 0.0304888 52.2126 7.084e-09 ***
Month 12 0.020835 0.0017363 2.9734 0.01397 *
Residuals 21 0.012263 0.0005839
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Clearly year and month seem to have a significant effect on the rail transportation being on time.
Code
Customer_injury_rate_per_1_Mill <- DC_metro %>%select('Year', 'Month', 'Customer_injury_rate_per_1_Mill')model <-aov(Customer_injury_rate_per_1_Mill ~ Year + Month, data = Customer_injury_rate_per_1_Mill)# Summarize the ANOVA results with significance codessummary_model <-summary(model)# Print the ANOVA table with significance codesprint(anova(model, test ="F"))
Analysis of Variance Table
Response: Customer_injury_rate_per_1_Mill
Df Sum Sq Mean Sq F value Pr(>F)
Year 2 0.1121 0.05606 0.1770 0.83903
Month 12 11.8039 0.98366 3.1056 0.01119 *
Residuals 21 6.6515 0.31674
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Only the month seems to have a significant effect on the customer injury rate (which is expected due to weather). However, if year does not, have we taken any necessary actions to improve the security on public transportation?
Code
Employee_injury_rate_per_200k_h <- DC_metro %>%select('Year', 'Month', 'Employee_injury_rate_per_200k_h')model <-aov(Employee_injury_rate_per_200k_h ~ Year + Month, data = Employee_injury_rate_per_200k_h)# Summarize the ANOVA results with significance codessummary_model <-summary(model)# Print the ANOVA table with significance codesprint(anova(model, test ="F"))
Analysis of Variance Table
Response: Employee_injury_rate_per_200k_h
Df Sum Sq Mean Sq F value Pr(>F)
Year 2 11.822 5.9109 9.0165 0.00149 **
Month 12 12.759 1.0632 1.6218 0.16025
Residuals 21 13.767 0.6556
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this case as year has a significant effect on employee injury rates and months do not, we can assume that some actions have been taken during the years.
Code
Crimes_per_1_Mill_passengers <- DC_metro %>%select('Year', 'Month', 'Crimes_per_1_Mill_passengers')model <-aov(Crimes_per_1_Mill_passengers ~ Year + Month, data = Crimes_per_1_Mill_passengers)# Summarize the ANOVA results with significance codessummary_model <-summary(model)# Print the ANOVA table with significance codesprint(anova(model, test ="F"))
Analysis of Variance Table
Response: Crimes_per_1_Mill_passengers
Df Sum Sq Mean Sq F value Pr(>F)
Year 1 0.2313 0.23133 0.6465 0.44209
Month 12 14.6777 1.22314 3.4182 0.03656 *
Residuals 9 3.2205 0.35783
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Only months seem to have a significant effect on crime rates produced on public transportation. We could use this information to decide what months of the year we should increase security and question if this effect is due to holidays or other important events.
Code
# preparing data to set a column in Date format in order to properly graph itDC_metro <- DC_metro %>%mutate_at("Year", str_replace, "CY ", "")head(DC_metro)DC_metro$Year <-as.numeric(DC_metro$Year)DC_metro <- DC_metro %>%mutate_at("Month", str_replace, "Jan", "01")DC_metro <- DC_metro %>%mutate_at("Month", str_replace, "Feb", "02")DC_metro <- DC_metro %>%mutate_at("Month", str_replace, "Mar", "03")DC_metro <- DC_metro %>%mutate_at("Month", str_replace, "Apr", "04")DC_metro <- DC_metro %>%mutate_at("Month", str_replace, "May", "05")DC_metro <- DC_metro %>%mutate_at("Month", str_replace, "Jun", "06")DC_metro <- DC_metro %>%mutate_at("Month", str_replace, "Jul", "07")DC_metro <- DC_metro %>%mutate_at("Month", str_replace, "Aug", "08")DC_metro <- DC_metro %>%mutate_at("Month", str_replace, "Sep", "09")DC_metro <- DC_metro %>%mutate_at("Month", str_replace, "Oct", "10")DC_metro <- DC_metro %>%mutate_at("Month", str_replace, "Nov", "11")DC_metro <- DC_metro %>%mutate_at("Month", str_replace, "Dec", "12")DC_metro <- DC_metro[order(DC_metro$Year, DC_metro$Month), ]# Drop YTD rows for monthDC_metro <-subset(DC_metro, DC_metro$Month!='YTD')head(DC_metro)library(zoo)DC_metro$Date <-as.yearmon(paste(DC_metro$Month, DC_metro$Year, sep =" "), format ="%m %Y")
While the spread of the points is very wide, it is concerning to see a trend in which the Buses seem to be more and more delayed over time. That means that we are not taking the necessary steps to improve it.
Rail transportation seems to clearly have been getting worse over time. This problem is concerning and should be tackled as soon as possible. Is it due to safety reasons, investment problems, or poor planification?
The employee injuries have been increasing over time. This means that if the trend keeps following this pattern, we should probably invest more in safety and take some more precautions for the employees.
The crime rates have been increasing over time. This means that if the trend keeps following this pattern, we should probably invest in more security on public transportation.
Text Data
In order to see what what are the most important concerns regarding public transportation to the users, we are going top explore what people mention the most in their reddits about public transportation through a word cloud.
Code
import reimport spacy.lang.en.stop_words as stopwordsimport spacyimport numpy as npimport pandas as pd import matplotlib.pyplot as pltimport osimport shutilfrom sklearn.feature_extraction.text import CountVectorizer# read in the datadf=pd.read_json("../../data/Raw_Data_project_Pub.Transport_5000/Reddit_sentiment_data/sentiment_results.json")print(df.shape)print(df.columns)# throw away all empty strings of text from df along with their url and sentiment_scoredf = df[df.text !='']print(df.shape)print(df.columns)texts = []y = []#https://spacy.io/usage/modelsparser =spacy.load('en_core_web_sm')stop_words = stopwords.STOP_WORDS# Iterate over rowsfor i inrange(df.shape[0]):# QUICKLY CLEAN TEXT keep ="abcdefghijklmnopqrstuvwxyz " replace =".,!;" tmp ="" text_value = df["text"].iloc[i] # Accessing the "text" column using .iloc text_value =re.sub('[^a-zA-Z ]+', '', text_value.replace("<br />", "").lower()) text_value =parser(text_value) tokens = [token.lower_ for token in text_value]#tokens = [token for token in text_value if token not in stop_words] tokens = [token.lemma_ for token in text_value if token not in stop_words] tmp =" ".join(tokens)texts.append(tmp)wordcloud_text =" ".join(texts)
Code
def generate_word_cloud(my_text): from wordcloud import WordCloud, STOPWORDS import matplotlib.pyplot as plt def plot_cloud(wordcloud):# Set figure sizeplt.figure(figsize=(40, 30))# Display imageplt.imshow(wordcloud) # No axis detailsplt.axis("off");# Generate word cloud wordcloud =WordCloud(width =3000,height =2000, random_state=1, background_color='salmon', colormap='Pastel1', collocations=False,stopwords = STOPWORDS).generate(my_text)plot_cloud(wordcloud)plt.show()worldcloud_text =" ".join(texts)# Words to removewords_to_remove = ["one", "go", "even", "give", "will", "need", "say", "well", "still", "make", "think", "look", "etc", "actually", "yet", "put"]# Remove specified words from the resultfor word in words_to_remove: wordcloud_text =wordcloud_text.replace(word, "")generate_word_cloud(wordcloud_text)
From this world loud, we can see what topics are the most talked about by people regarding publioc transportation. This can help us see what matters to them the most and focus on these topics.